Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)

نویسندگان

  • Mohammad Arshi Saloot
  • Norisma Idris
  • AiTi Aw
  • Dirk Thorleuchter
چکیده

In recent years, social networks, microblogs, and short message service have deeply penetrated peoples lives, and thus, chat-style text is a common phenomenon. This chat-style text has many unknown features for linguists, which can be discovered by analyzing a chat-style corpus. The process of constructing a corpus conforms to specific corpus criteria, such as representativeness, sampling, variety, and chronology. Up to now, literature does not provide specific corpus criteria for creating a chat-style-text corpus. In contrast to related work, corpus criteria for creating a chat-style corpus are provided. An exhaustive and reliable Malay chat-style text corpus is still lacking. Thus, the provided criteria are used to demonstrate the process of constructing a Twitter corpus known as the Malay Chat-style Corpus (MCC). The MCC, which has 1 million twitter messages, consists of 14,484,384 word instances, 646,807 terms and metadata, such as posting time, used twitter client application, and type of Twitter message (simple Tweet, Retweet, Reply). Furthermore, the results of the analysis of the MCC reveal characteristics of the corpus including the most frequent terms and collocations, Zipf law diagram, Twitter peak hours, and percentages of message types. Finally, representativeness of the corpus is evaluated by employing cartography and automatic language identification methods. This corpus and the process of corpus creating are valuable for researchers working in linguistics, natural language processing, and data mining. ................................................................................................................................................................................. Correspondence: Mohammad Arshi Saloot, Department of Artificial Intelligence, Faculty of Computer Science & Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia. E-mail: [email protected] Digital Scholarship in the Humanities, Vol. 31, No. 2, 2016. The Author 2014. Published by Oxford University Press on behalf of EADH. All rights reserved. For Permissions, please email: [email protected] 227 doi:10.1093/llc/fqu066 Advance Access published on 14 December 2014 by gest on July 8, 2016 http://dshrdjournals.org/ D ow nladed from

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TwiSty: A Multilingual Twitter Stylometry Corpus for Gender and Personality Profiling

Personality profiling is the task of detecting personality traits of authors based on writing style. Several personality typologies exist, however, the Myers-Briggs Type Indicator (MBTI) is particularly popular in the non-scientific community, and many people use it to analyse their own personality and talk about the results online. Therefore, large amounts of self-assessed data on MBTI are rea...

متن کامل

Anomaly Detecting Within Dynamic Chinese Chat Text

The problem in processing Chinese chat text originates from the anomalous characteristics and dynamic nature of such a text genre. That is, it uses ill-edited terms and anomalous writing styles in chat text, and the anomaly is created and discarded very quickly. To handle this problem, one solution is to re-train the recognizer periodically. This costs a lot of manpower in producing the timely ...

متن کامل

Corpus Design for Malay Corpus-based Speech Synthesis System

Problem statement: Speech corpus is one of the major components in corpus-based synthesis. The quality and coverage in speech corpus will affect the quality of synthesis speech sound. Approach: This study proposes a corpus design for Malay corpus-based speech synthesis system. This includes the study of design criteria in corpus-based speech synthesis, Malay corpus based database design and the...

متن کامل

Arabizi Identification in Twitter Data

In this work we explore some challenges related to analysing one form of the Arabic language called Arabizi. Arabizi, a portmanteau of Araby-Englizi, meaning Arabic-English, is a digital trend in texting Non-Standard Arabic using Latin script. Arabizi users express their natural dialectal Arabic in text without following a unified orthography. We address the challenge of identifying Arabizi fro...

متن کامل

An architecture for Malay Tweet normalization

Research in natural language processing has increasingly focused on normalizing Twitter messages. Currently, while different well-defined approaches have been proposed for the English language, the problem remains far from being solved for other languages, such as Malay. Thus, in this paper, we propose an approach to normalize the Malay Twitter messages based on corpus-driven analysis. An archi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • DSH

دوره 31  شماره 

صفحات  -

تاریخ انتشار 2016